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AMENDMENTS TO THE CLAIMS 

The Assignee submits below a complete listing of the current claims, including marked-up 
claims with insertions indicated by underlining and deletions indicated by strikeouts and/or double 
bracketing. This listing of claims replaces all prior versions and listings of claims in the application: 

1 . (Currently amended) A program storage device readable by a machine, tangibly 
embodying a program of instructions executable by the machine to perform a method for speech 
synthesis that allows user specified pronunciations, the method comprising: 

extracting prosodic parameter[[s]] values from an audio signal corresponding to a 
pronunciation of a text string by a user; 

extracting duration parameter[[s]] values from the audio signal by aligning the audio signal 
with the text string; 

automatically generating at loast one t e xt to apoeoh (TTS) input using the prosodic 
paramet e rs and the duration parameters, wherein th e at least on e TTS input is formatt e d for use in 
synthesizing sp e ech from the t e xt string; and 

adopting as synthesis parameter values the prosodic parameter values and duration 
parameter values extracted from the audio signal; and 

generating a synthetic speech waveform using the at loast on e TTS input synthesis parameter 

values . 

2-3. (Canceled) 

4. (Previously Presented) The program storage device of claim 1 , wherein the instructions for 
aligning comprise instructions for segmenting the audio signal into time-segmented regions, 
wherein each time-segmented region is mapped to a corresponding phoneme. 

5. (Previously Presented) The program storage device of claim 1, wherein the alignment is 
performed using a Viterbi alignment process. 
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6. (Canceled) 

7. (Currently amended) The program storage device of claim 1 , wherein the instructions for 
automatically generating at least ono TTS input the synthetic speech waveform comprise 
instructions for directly specifying at least one portion of the duration param e t e rs and/or the 
prosodic synthesis parameter [[s]] values as attribute values for mark-up elements. 

8. (Currently amended) The program storage device of claim 1, wherein the instructions for 
automatically generating at least ono TTS input the synthetic speech waveform comprise 
instructions for assigning abstract labels to at least one portion of the duration parameters and/or th e 
prosodic synthesis parameter[[s]] values to generate a high-level markup of the text string . 

9. (Currently amended) The program storage device of claim 1 , wherein the at least on e TTS 
input is generated generatinfi comprises generating a markup of the text string using SSML (speech 
synthesis markup language). 

10. (Previously Presented) The program storage device of claim 1, further comprising 
instructions for processing phonetic content of the audio signal to generate the synthetic speech 
waveform having a desired pronunciation. 

1 1 . (Currently amended) A method for speech synthesis that allows user specified 
pronunciations, the method comprising: 

extracting prosodic parameter[[s]] values from an audio signal corresponding to a 
pronunciation of a text string by a user; 

extracting duration parameter[[s]] values from the audio signal by aligning the audio signal 
with the text string; 
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automatically gen e rating at least on e t e xt to speech (TTS) input using the prosodio 
paramet e rs and the duration paramet e rs, wher e in the at l e ast one TTS input is formatted for us e in 
synth e sizing spe e ch from th e t e xt string; and 

adopting as synthesis parameter values the prosodic parameter values and duration 
parameter values extracted from the audio signal: and 

generating a synthetic speech waveform using the at least on e TTS input synthesis parameter 

values . 

12-13. (Canceled) 

14. (Previously Presented) The method of claim 1 1 , wherein aligning comprises extracting 
acoustic feature data from the audio signal and time-aligning the audio signal to the text string 
using the acoustic feature data. 

1 5 . (Previously Presented) The method of claim 1 1 , wherein aligning is performed using a 
Viterbi alignment process. 

16. (Canceled) 

1 7. (Currently amended) The method of claim 1 1 , wherein automatically generating at least one 
TTS input the synthetic speech waveform comprises directly specifying at least one portion of the 
duration parameters and/or th e prosodic synthesis parameter[[s]] values as attribute values for mark- 
up elements. 

18. (Currently amended) The method of claim 1 1 , wherein automatically generating at l e ast on e 
TTS input the synthetic speech waveform comprises assigning abstract labels to at least one portion 

of the duration param e t e rs and/or the prosodic synthesis parameter[[s]] values to generate a high- 

» 

level markup of the text string . 
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19. (Currently amended) The method of claim 1 1, wherein the at least on e TTS input is 
gen e rat e d generating comprises generating a markup of the text string using SSML (speech 
synthesis markup language). 

20. (Previously Presented) The method of claim 1 1 , further comprising processing phonetic 
content of the audio signal to generate the synthetic speech waveform having a desired 
pronunciation. 

2 1 . (Currently amended) A text-to-speech (TTS) system that allows user specified 
pronunciations, comprising: 

a prosody analyzer for determining prooodic synthesis parameter[[s]] [[of]] values from an 
audio signal corresponding to a pronunciation by a user of an input text string and automatically 
generating at loaat one TTS system input corr e sponding to the audio signal using the prosodic 
param e ters , wherein the prosody analyzer comprises: 

a prosodic parameter extraction module for extracting [[the]] prosodic 
parameters]] values from the audio signal, 

an alignment module for extracting duration parameters]] values from the 
audio signal by aligning the input text string with the audio signal, and 

a conversion module for generating the at l e ast one TTS system input using 
the prnr . nHir. pnrnmfltorn nnd the duration param e t e rs adopting as synthesis parameter 
values the prosodic parameter values and duration parameter values extracted from 
the audio signal : and 

a TTS system engine for generating a synthetic speech waveform using the at least one TTS 
system input synthesis parameter values . 

22. (Previously Presented) The system of claim 21, further comprising a user interface that 
enables a user to input the audio signal and the input text string corresponding to the audio signal. 
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23 . (Previously Presented) The system of claim 2 1 , wherein the prosody analyzer processes 
phonetic content of the audio signal to generate the synthetic waveform having a desired 
pronunciation. 

24-28. (Canceled) 

29. (Previously Presented) The program storage device of claim 33, wherein extracting acoustic 
feature data from the audio signal comprises digitizing the audio signal into a set of frames and 
transforming the digitized audio signal into a set of feature vectors on a frame-by-frame basis. 

30. (Previously Presented) The program storage device of claim 29, wherein transforming the 
digitized audio signal comprises producing a 24-dimensional cepstra feature vector for every 10ms 
of the audio signal, concatenating frames to the left and to the right of a current frame to augment a 
current cepstral vector, and reducing each augmented cepstral vector to a 60-dimensional feature 
vector using linear discriminant analysis. 

3 1 . (Previously Presented) The method of claim 34, wherein extracting acoustic feature data 
from the audio signal comprises digitizing the audio signal into a set of frames and transforming the 
digitized audio signal into a set of feature vectors on a frame-by-frame basis. 

32. (Previously Presented) The method of claim 3 1 , wherein transforming the digitized audio 
signal comprises producing a 24-dimensional cepstra feature vector for every 10ms of the audio 
signal, concatenating frames to the left and to the right of a current frame to augment a current 
cepstral vector, and reducing each augmented cepstral vector to a 60-dimensional feature vector 
using linear discriminant analysis. 

33. (Previously Presented) The program storage device of claim 1 wherein the method further 
comprises extracting acoustic feature data from the audio signal and wherein the aligning further 
comprises outputting a set of duration contours. 
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34. (Previously Presented) The method of claim 1 1 further comprising extracting acoustic 
feature data from the audio signal and wherein the aligning further comprises outputting a set of 
duration contours. 

35. (Previously Presented) The system of claim 21 wherein the prosody analyzer further 
comprises an acoustic feature extraction module that extracts acoustic feature data from the audio 
signal and wherein the alignment module uses said acoustic feature data to perform the aligning. 
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